Detecting OOV Named-Entities in Conversational Speech

نویسندگان

  • Rohit Kumar
  • Rohit Prasad
  • Sankaranarayanan Ananthakrishnan
  • Aravind Namandi Vembu
  • David Stallard
  • Stavros Tsakalidis
  • Premkumar Natarajan
چکیده

A common cause of errors in spoken language systems is the presence of out-of-vocabulary (OOV) words in the input. Named entities (people, places, organizations, etc.) are a particularly important class of OOVs. In this paper we focus on detecting OOV named entities (NEs) for two-way English/Iraqi speech-tospeech translation. Our approach builds on Maximum Entropy (MaxEnt) classifier trained on a suite of contextual features. These features include: n-gram context, part-of-speech tags (both supervised and unsupervised), and word posterior features computed from the trajectory of the word posteriors within the utterance. Our experimental results show that fusion (both early and late) of these novel word posterior features with rest of the contextual features significantly improves detection accuracy for OOV NEs. However, we also observe that the same features that perform well on OOV NEs can hurt in detecting in-vocabulary NEs. Therefore, the choice of the features should be based on expected occurrence of OOV NEs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OOV Sensitive Named-Entity Recognition in Speech

Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...

متن کامل

Source-Error Aware Phrase-Based Decoding for Robust Conversational Spoken Language Translation

Spoken language translation (SLT) systems typically follow a pipeline architecture, in which the best automatic speech recognition (ASR) hypothesis of an input utterance is fed into a statistical machine translation (SMT) system. Conversational speech often generates unrecoverable ASR errors owing to its rich vocabulary (e.g. out-of-vocabulary (OOV) named entities). In this paper, we study the ...

متن کامل

Variable-Span out-of-vocabulary named entity detection

Out-of-vocabulary named entities (OOV NEs) are always misrecognized by fixed-vocabulary automatic speech recognition (ASR) systems. This has a negative impact on downstream applications such as language understanding and machine translation (MT). Automatic detection of OOV NEs in ASR hypotheses can help mitigate this problem by triggering the use of alternative approaches to acquire and process...

متن کامل

Named entity tagged language models

We introduce Named Entity (NE) Language Modelling, a stochastic nite state machine approach to identifying both words and NE categories from a stream of spoken data. We provide an overview of our approach to NE tagged language model (LM) generation together with results of the application of such a LM to the task of out-of-vocabulary (OOV) word reduction in large vocabulary speech recognition. ...

متن کامل

THE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition

Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012